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(57) An apparatus and method is provided that sup- 
ports continuous media for conventional networked 
workstations and PC's. Described are user-level mech- 
anisms and policies designed to give good, efficient mul- 
timedia service under the mild assumption that the op- 
erating system provides a preemptive real-time sched- 
uling class that can be used to give CPU cycles to the 
multimedia processes in preference to other processes 
that are not time sensitive. There are no modifications 
to the operating system kernel and isochronous net- 
works are not required. It suffices for an application to 
state that it wants to play a particular stream of a type 
known to the server (e.g. afile containing MPEG-1 video 
320x240 pixels, 8-bit color, 30 frames per second), or 
for the application to specify a frame rate and an index 
describing the offset of each frame in a file. 
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Description 

BACKGROUND OF THE INVENTION 

Field of the Invention 

This invention relates to resource scheduling and, 
more particularly, to dynamic hierarchial resource 
scheduling for continuous media based on initial con- 
servative estimates and then slack filling. 

Description of the Related Art 

As loads increase on computers and workstations, 
smooth playout of continuous media streams have had 
greater difficulty in being maintained because of the 
computer's loss of Central Processor Unit (CPU) cycles 
and network and disk bandwidth, leading to delayed 
transfers. Demand has risen for continuous media serv- 
ice on networks of conventional workstations and per- 
sonal computers running common operating systems 
that can handle more strenuous loads. But the normal 
mechanisms in this environment provide resource allo- 
cation with no allowance for time-sensitive behavior. 

In order for media service to maintain a continuous 
stream of data, resources need to be managed to avoid 
shortages that would cause interruptions. In general, the 
most of the prior art has been geared toward hard, real- 
time scheduling. This approach requires an environ- 
ment where every latency and resource requirement is 
known. See D. Niehaus, et al. The Spring Scheduling 
Co-Processor: Design, Use and Performance; In Pro- 
ceedings Real-Time Systems Symposium Raleigh-Dur- 
ham NC, pgs. 106-111, Dec. 1993; and K. Schwan et 
al., Dynamic Scheduling of Hard Real-Time Tasks and 
Real-Time Threads, IEEE Transactions on Software En- 
gineering, 18(8): 736-748, Aug. 1992. Hard real-time 
scheduling is computationally expensive, and is not nor- 
mally possible in general timesharing environments 
characterized by unpredictable resources and delays. 

A second common approach has been with con- 
servative resource usage estimates. Admission control 
limits the service load to that known to be safe in the 
worst case, or known to be safe with high probability. 
See, for example, D.P. Anderson, Metascheduling for 
continuous media. ACM Transactions on Computer 
Systems, 11 (3): 226-252, Aug. 1993 and D.P. Anderson 
et al.; a file system for continuous media; ACM Trans- 
actions on Computer Systems, 10(4): 31 1-337, 1992. 
However, conservative estimates lead to low utilization 
in a general operating system setting. 

Admission control and resource scheduling can al- 
so be performed by a model of the resources available 
and the resource consumption demanded by a new re- 
quest. As an example, see P. Loucher et al.; The Design 
Of A Storage Server For Continuous Media, The Com- 
puter Journal, 36(1):2-42, 1993. One problem here is 
that a resource model of the system cannot be both 



highly accurate and inexpensively computed, particular- 
ly for general timesharing environments. Another prob- 
lem is that the precise resource consumption implied by 
a new request for service is generally unknown. 

The above approaches generally share the "admis- 
sion control assumption" , namely that a request for serv- 
ice should either be refused immediately, or should be 
guaranteed good service. To support the admission con- 
trol assumption, the systems must place fairly strong re- 
quirements on the determinism and predictability of the 
system and workload. 

SUMMARY OF THE INVENTION 

The present invention provides a method for han- 
dling continuous media efficiently and with good service 
quality on conventional computers and operating sys- 
tems, given a real-time scheduling priority such as in 
Sun's Solaris or Microsoft's Windows NT The key ele- 
ments of the method of the apparatus and method are 
to perform hierarchical resource scheduling based on 
conservative estimates of resource consumption, and 
based on dynamic monitoring of actual resource con- 
sumption, improve utilization by slack filling, protect the 
quality of service for the maximal number of important 
streams by focusing degradation during periods of re- 
source insufficiency on the streams deemed least im- 
portant according to some prioritization policy, and use 
application-specified policies to degrade streams, rath- 
er than a common server-specified policy. 

Input/Output (I/O) requests are scheduled by the 
present invention to the operating system on the server 
so as to manage the bandwidth of the disks, disk con- 
trollers, memory subsystem, and network to get the 
needed multimedia frames to the clients in the right time- 
frames. If the network supports bandwidth reservation, 
the network portion of the scheduling reduces to simple 
admission and monitoring. The techniques for resource 
management in the client/server environment are also 
applicable to systems playing continuous media from lo- 
cal disks. The network scheduling is applicable to local 
area network (LAN) environments where the server 
send and client receive operations are closely coupled. 

An alternative to the admission control assumption 
is provided that is suitable for a general application en- 
vironment, such as UNIX workstations and PC's on net- 
works, that may not support bandwidth reservation. A 
hierarchical resource scheduler based on simple con- 
servative resource availability and consumption func- 
tions, identifies the initial set of transfers to be issued to 
the operating system at the beginning of each schedule 
cycle. A method for dynamic slack determination and 
filling observes actual resource consumption during a 
cycle and dynamically initiates additional transfers to im- 
prove utilization. 

In an environment where the available resources 
cannot be characterized precisely, and the resource de- 
mands implied by client requests for continuous media 
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service are not accurately known, the methods of the 
present invention give good service to the most impor- 
tant continuous media jobs, focusing resource insuffi- 
ciency on the least important jobs. Most important is de- 
termined by an arbitrary prioritization policy that places 
a rank ordering on the jobs. This is the best that can be 
done without stronger assumptions on the predictability 
of the system and workload. 

The focus of the invention is on playback of contin- 
uous media, e.g. audio and video. The recording of con- 
tinuous media is not a time sensitive activity. A server 
having sufficient bandwidth to accommodate the stream 
can use buffering and delayed writes to handle record- 
ing of continuous media at its convenience. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The features of the subject invention will become 
more readily apparent and may be better understood by 
referring to the following detailed description of an illus- 
trative embodiment of the present invention, taken in 
conjunction with the accompanying drawings, where: 

Fig. 1 illustrates a general configuration of the 
scheduler in a preferred embodiment of the present 
invention; 

Fig. 2 illustrates a flow of the overall approach of 
the present invention; 

Fig. 3 shows a typical resource tree of the present 
invention; and 

Fig. 4 shows a typical flow of job scheduling in the 
present invention. 

DETAILED DESCRIPTION OF PREFERRED 
EMBODIMENTS 

Fig. 1 illustrates, in a simplified manner, the general 
configuration of the preferred embodiment. Continuous 
media are stored on disks 20 and 22, and portions are 
buffered into memory as they are being accessed by us- 
er applications via the data transfers and connection re- 
quests. When a new job arrives and requests service, 
the job admission resource scheduler/server 26 gener- 
ates one of three responses: 

Trivial admit (The system is very underloaded and 
can provide the requested service); 
Trivial reject (The system is very overloaded and 
service is refused); or 

Conditional admit (Service will be attempted). 

The guarantee given by admission is not uncondi- 
tional good service. It is that the service to an admitted 
stream will not be degraded until all "less important" 
streams have first had their service cut off, where "less 
important" is defined by some importance ranking policy 
with respect to each application. 

The server issues batches of disk transfers through 



the disk I/O 24 and network transfer requests through 
the network I/O 30 to the operating system to gain the 
benefits of disk scheduling and overlaps. This means 
the requests will complete in unpredictable orders. Cy- 

s die scheduling is used to establish time frames that can 
be monitored to detect tardiness before jobs actually 
miss their deadlines. 

For each cycle, the preferred embodiment performs 
relatively simple conservative resource modeling to 

10 grant bulk admission to this cycle for the most important 
jobs that can safely be predicted to complete on time as 
part of the server. During the cycle, the server does on- 
line tracking of actual resource consumption, and dy- 
namically admits additional transfers to this cycle when 

is possible. This maintains the inexpensive cost of simple 
modeling, while achieving a good utilization by dynam- 
ically filling slack regained from conservative estimates 
by the early completion of transfers. 

The server does not actively manage CPU cycles. 

20 it runs in a real time process class so the continuous 
media service can prevent other jobs (that are not as 
time-sensitive) from starving the continuous media 
streams for CPU cycles. Given this, the system bottle- 
neck is then normally somewhere in the disk, network, 

25 or memory subsystems. 

The preferred embodiment of the present invention 
does not depend on a balanced architecture. The re- 
source scheduling dynamically adjusts to the bottleneck 
resource, whetherthatbeadisk, disk controller, network 

30 interface, or the memory subsystem. Consider, for ex- 
ample, a fast workstation with Ethernet versus one with 
a gigabit ATM network. 

The network is not required to support bandwidth 
reservations, but if they are available, then the system 

35 efficiency will increase because more streams will be 
granted bulk admission to a cycle, and fewer will be 
served by absorbing slack dynamically freed during the 
schedule. In addition, clients obtain better assurance of 
high service quality because the hazard of network 
bandwidth shortages has been eliminated. For non-res- 
ervation networks like Ethernet, the clients will experi- 
ence degradation if the network bandwidth goes away. 

If the system is not totally isochronous so it can be 
perfectly scheduled, or if scheduling is not maximally 

45 conservative, then there will be overloads. Consider, for 
example, variable bit rate compressed video, or scripted 
presentations that have peaky bandwidth demands. 
Given that the user interface provides VCR functions, 
including pause, fast forward, and rewind, a subsequent 

50 user can cause the peaks to coincide in the worst pos- 
sible way. 

Numerous techniques have been proposed for toad 
shedding. Some systems serve three classes of clients, 
i.e. continuous media, interactive response, and back- 
55 ground, the latter two have their service rate reduced to 
save resources for continuous media. Others suppose 
that some streams are pre-identified as degradable, i.e. 
able to cope with reduction of bandwidth, and their sys- 
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tern degrades these streams first, based on measures 
of how loss tolerant each client is, spreading the degra- 
dation proportional to loss tolerance. 

Still other proposals for load shedding include the 
use of scalable digital coding and service degradation 
by reducing resolution by deferral of processing for re- 
quests that the application has indicated that are less 
important and the reduction of bandwidth by reducing 
resolution or frame rate. 

In actuality, these are policies, so the server has no 
business selecting which of these techniques to apply. 
If a stream is missing its deadlines because of system 
overload, an application-selected policy should be ap- 
plied to that stream. The server just provides the mech- 
anisms. 

A strategy of degrading service to cope with over- 
load is perceptually significant, because slight degrada- 
tion of the streams wonl yield significant resource sav- 
ings. For example, if the server disk subsystem is the 
bottleneck, dropping 10% of the streams will shed 10% 
of the load, but dropping 10% of the frames in all streams 
sheds much less than 10% of the load because disk 
seek overhead becomes relatively higher. Omitting a 
few frames from the middle of a whole track read may 
result in no mitigation of overload whatsoever. 

The conventional admission control assumption 
leads to policies that accept new clients for service, even 
if this means degrading the quality of service provided 
to existing clients, up to some threshold beyond which 
new clients are refused. By contrast, the admission con- 
trol of the preferred embodiment does not predict the 
degradation that would result from admitting the next cli- 
ent, because clients are admitted subject to the condi- 
tion that any degradation required to handle a resulting 
overload will be focused on the least important streams. 
In particular, the present invention provides a mecha- 
nism to protect a subset of the streams, and uses a pri- 
oritization policy that explicitly chooses which streams 
to protect and which others to degrade. Any policy can 
be used that gives a rank ordering over all the streams. 
A simple policy that appears useful for the server is to 
protect the oldest streams, degrading the newest. Thus 
streams that have been running for a long time will tend 
to be protected from degradation, while streams that re- 
cently started may be interrupted if it turns out that there 
really isnl sufficient capacity to serve them. A new client 
who requests service can quickly see whether the qual- 
ity of service is satisfactory and decide whether to con- 
tinue or quit. Thus resource insufficiency is focused on 
unsupportable newly added load, stimulating client driv- 
en load shedding. Furthermore, a client can state how 
it would like be degraded, e.g. lower frame rate or res- 
olution, if it would otherwise be suspended because of 
resource insufficiency. Note that this has been a discus- 
sion of load shedding policy, i.e. who to degrade during 
overload, not how conservative to make admission con- 
trol to set the tradeoff between system utilization and 
probability of degradation. 



Consider composite streams such as video with au- 
dio, or scripted presentations consisting of a time-relat- 
ed collection of continuous media streams and discrete 
data objects such as images that will be synchronized 

5 at the client for joint presentation. From a human factors 
standpoint, loss of audio is a much greater failure than 
loss of video. Therefore, scheduling should consider 
more than a frame's intended presentation time. In sys- 
tems that cant provide hard real-time guarantees, it will 

10 send small, valuable frames well ahead of time, and 
buffer them on the client side. This can be hidden in the 
client-side libraries that handle composite audio-video 
streams by pre-f etching audio data to the client buffer 
far in advance. Transfer scheduling to the client is dis- 

15 tinct from presentation scheduling at the client. 

The server of the preferred embodiment supports 
applications that trade aggressive degradation of some 
streams in exchange for better protection of others. Ap- 
plications can specify tradeoffs in the prioritization poli- 

20 cies of the component streams of a composite or script- 
ed job. A newly arriving job can boost the priority of some 
components while antiboosting the priority of others, 
provided (a) that the bandwidth-weighted average pri- 
ority does not increase and (b) it doesnl injure estab- 

25 Hshed streams unless these streams have antiboost pri- 
ority. This way, a stream is only permitted to boost if it 
does not cause an overload, or causes an overload that 
only degrades antiboosted streams. 

It may be necessary to prevent an application from 

30 boosting the priority of a scripted component now in ex- 
change for hurting another component in the future, oth- 
erwise it could cheat by appending dummy components 
to the end of a script, boosting to get good service now 
in exchange for degradation of the dummy portion that 

35 wont really be used anyway. Two ways to handle this 
are (1 ) an economic solution: charge for the entire com- 
posite, including dummy portions, or (2) pre-verification 
or policing of the entire schedule: during no cycle does 
the bandwidth-weighted boost exceed the antiboost. 

40 Similarly, an application should not be able to pad com- 
posite streams with dummy streams that exist only to 
be degraded. Two ways to handle this are (1 ) economic: 
charge for the entire composite including dummy por- 
tions, or (2) discriminate against higher bandwidth jobs 

45 so that adding a dummy stream and degrading it maxi- 
mally, boosts the remaining streams only to the priority 
they would have obtained had the dummy stream not 
been a member of the composite in the first place. 
The overall approach of favoring older streams en- 

50 courages a client to improve its priority by opening a 
stream hours ahead of time, but not starting the actual 
datatransfer until much later, at which point it could have 
amassed sufficient priority to displace clients that had 
been running smoothly for the past tens of minutes. 

55 in the ensuing discussion, the following terminology 
is used: 

Cyclic scheduling: a periodic activity that, given the 
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service desired by each job for each time period, 
schedules the necessary operations based on esti- 
mates of the server's capacity for each resource 
and the estimated resource demands of each job. 
Good estimates can improve server efficiency, but 
the algorithms wilt dynamically compensate for 
overconservative estimates as actual resource uti- 
lization is observed. 

Cycle fault: occurs when some activities that are 
scheduled for a time period have not completed by 
the end of that cycle. These unfinished activities are 
terminated or rescheduled according to their job's 
disposition policy. 

Estimated server slack: with respect to a resou rce 
and a cycle, is the amount by which the estimated 
server capacity exceeds the estimated job de- 
mands. 

As shown in Fig. 2, when a new request arrives 40, 
job admission is performed. Note that it is necessary to 
treat the job, which may be a composite stream or script- 
ed presentation, as the unit of admission and priority. It 
would not be satisfactory to degrade a scripted presen- 
tation late in its show just because it closes one file and 
opens another, therefore giving the appearance of start- 
ing a very young continuous media strand. 

The goal of job admission is quick rejection of new 
jobs when the server is clearly in a state of heavy over- 
load. Otherwise a job is conditionally admitted and the 
cyclic scheduling will attempt to ensure that the job re- 
ceives service only if sufficient server capacity actually 
is available. This strategy differs from the more common 
approach, in which systems must predict resource de- 
mands and capacity very accurately and reject a job if 
the server actually would not be able to serve it. Be- 
cause of the robustness of the present invention's cyclic 
scheduling, a simple algorithm is sufficient for job ad- 
mission. 

Given the lightly loaded resource threshold param- 
eter, X R) the job multiplicity threshold parameter, Y, and 
the resource insufficiency threshold parameter 2, a job 
J is tested for admission as follows: 

* If all current jobs are being served 42 without cycle 
faults and the estimated resource consumption for 
J is less than X R % of the estimated server slack for 
each resource ft, then Trivial Admit 44; 

* Else if the current number of admitted jobs exceeds 
V46, then Trivial Reject 48; 

* Else if some conditionally admitted jobs currently 
are not being served because of resource insuffi- 
ciency for resource ft 50, and the sum of their esti- 
mated resource demands for R plus the estimated 
demand for ft by the new job J exceeds the estimat- 
ed server capacity by a factor of Z¥o, then Trivial 
Reject 52. 

* Else Conditional Admit 54. 



The threshold parameters X ft Y, and Zneed empir- 
ical determination. X- 25% and Z- 100% are reason- 
able starting values. Vis set low enough that scheduling 
and disposition processing, which are linear in the 

& number of jobs, do not dominate the scheduling cycle. 
The admission algorithm can be modified in an al- 
ternate embodiment, if it is desirable to have a class of 
jobs that are permitted to barge into the server and dis- 
place currently running jobs. In that case "trivial reject- 
to would be replaced with "repeatedly remove from the 
system the least important job until the current job can 
be conditionally admitted''. 

An admitted job is assigned to the first cycle during 
which it has service requests. The job's data transfer 

75 requests for the cycle are inserted into a priority queue 
for that cycle, ordered by the prioritization policy. For ex- 
ample purposes, this discussion uses the prioritization 
policy discussed earlier, i.e. an older job has higher pri- 
ority than a younger job, and the priority of a continuous 

20 media stream belonging to a job is equal to the priority 
of the job plus any boosting or antiboosting. When the 
transfer for a job completes or is disposed of at cycle 
end, the job is assigned to the next period during which 
it has service requests. 

2S The scheduler needs to know at the beginning of a 
cycle how much data is requested, and from which 
disks. The transfers requested by a job are conveyed to 
the server by a stream description or script that identifies 
the specific data frames needed and the times at which 

30 they are needed. This encompasses a variety of styles 
of client interaction, ranging from client driven frame-by- 
frame read operations to server-driven periodic data 
transmission subject to client commands such as 
pause, fast-forward, or rewind. Among the ways to pro- 

35 vide this information are a frame index and desired 
frame rate; a file name and frame rate for a data format 
known to the server, such as Motion JPEG or MPEG; or 
a specification that reserves bandwidth from a particular 
disk, and a sequence of read requests (offset and 

40 length) that arrive in a timely fashion so the server can 
retrieve the data during the scheduled time frame. 

One property of the interface to note is that the ap- 
plication is not requested to state how much of various 
resources in various categories it needs to obtain its de- 

4£ sired quality of service. It just states frames per second 
and identifies the data to be played. It is the server's 
responsibility to assess resource demands and deter- 
mine whether the server can handle it. 

Each stream that requires frames during a cycle is 

so scheduled for one transfer during the cycle. The cycle 
length is set to balance a tradeoff. On one hand, a short 
cycle improves the response time of new streams start- 
ing up. On the other hand, a long cycle improves system 
efficiency because overheads are amortized over larger 

55 transfers, although the total required buffer space grows 
with increasing cycle length. For the continuous media 
transfers of the present invention, the space require- 
ment is directly related to the cycle length because one 
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cycle's volume of data is buffered between disk and net- 
work. 

For example, take the average data rate of a stream 
as R, and was the percent of the disk resource that is 
to be wasted on access time overhead. For a disk whose 
transfer rate is let t a and t t be the expected average 
access and transfer time, respectively. 

This implies: 

For an average transfer size of B bytes, t t xr d = B. But 
within a cycle of length T, B=TxR bytes can be trans- 
ferred. Therefore, TxR~t t x r d an&. 

Leading to: 

0 - ") x * a x r d 
wX R 

In order to waste less than w= 50% of the disk on access 
time overhead, the cycle length needs to be set: 

For example, consider a SCSI disk with 20 ms. av- 
erage access time and 3.4 MB/s disk transfer rate, play- 
ing MPEG video streams having an average bandwidth 
of 1.12 megabits per second. Then the cycle period 
should be 1/2 sec or more. 

Any fast subsystem can use the idle time at the end 
of the cycle to work ahead, bounded by the buffer space 
allocated to each stream. For example, it is desirable for 
the disk subsystem to buffer ahead by 1 second to pro- 
vide robustness in the face of rare disk delays such as 
seek retries (typically, 20 ms.) or recalibration cycles 
(typically 3/4 sec.). 

Issuing multiple concurrent disk reads to a single 
disk enables the operating system's disk scheduling to 
improve the average access time, which supports a 
shorter cycle period. By allocating disk space in track- 
sized blocks, transfers scheduled by known methods 
are so efficient that constrained placement of data on 
disk becomes unimportant. 

The cycle length is directly related to the latency of 
starting a new stream. If the network and disk cycle 
times are 1/2 second, the maximum server-induced la- 
tency for successful startup of a new job is 1 second. 
This is tolerable from a human factors standpoint. For 
one-time transfers (e.g. image retrieval) the delay im- 



plied by a long cycle may be an issue from a human 
factors standpoint. For instance, if a user is interactively 
traversing hypermedia links, latency less than 200 mil- 
liseconds would be more suitable. The real-time sporad- 

5 ic server research and slack stealing approach suggest 
moving one-time jobs ahead in the schedule, slack per- 
mitting, and suggests the insertion of rescheduling 
points within the cycle to reduce startup latency. 

The maximum read size is bounded by the band- 
to width per cycle. If reads are large by comparison with 
the cycle length, utilization will suffer because a large 
slack must accumulate before an additional read can fit 
into the schedule. A compensating technique is to utilize 
idle slack at the end of each cycle by starting the next 

is cycle early to work ahead. In this case, a limitation on 
buffer space per stream will bound how far ahead that 
stream is fetched. Fragmenting requests into smaller 
pieces can improve schedu lability, but adds overhead. 
At the beginning of a cycle, the transfer requests in 

20 the queue are processed by hierarchical resource 
scheduling. The resource hierarchy consists of the 
memory bandwidth, the bandwidth for each disk control- 
ler and network controller, and each disk drive's band- 
width, forming the resource tree depicted in Figure 3. 

25 Hierarchical resource scheduling for a cycle is a 
culling out process. First the memory bandwidth is 
checked, and the lowest priority transfer requests are 
removed until the total bandwidth demand is within the 
resource limit. The list of transfer requests passing this 

30 filter is split into lists for each disk controller and network 
interface. These lists are filtered by the respective band- 
widths of those busses and controllers. Then the list for 
a disk controller is split into lists for each disk, which are 
again culled out based on a conservative estimate of the 

35 disk access and transfer times for each request. Trans- 
fers that have not been culled out by the above process 
are said to have been admitted to the current cycle. 
These resource filters are conservative so that jobs not 
culled out will complete without cycle fault. 

40 Other important resources are CPU cycles, buffer 
space, and disk space. As discussed previously, the 
CPU is assumed not to be the bottleneck, so beyond the 
use of a real-time priority class for continuous media 
processes, no CPU management is performed. Buffer 

45 space management is by allocating a pool per stream, 
granted at job admission (not cycle admission) . Disk 
space is allocated at job admission for jobs that perform 
recording. Bandwidth for any streams that are recording 
must be obtained from the scheduler. Resources must 

50 also be reserved to handle control messages from the 
clients so that streams are suitably responsive to com- 
mands such as skip, pause, abort, etc. Additional time 
in the cycle, proportional to the number of items in that 
cycle's job queue, is reserved for scheduling and dispo- 

55 sition processing. 

For simplicity, the preferred embodiment is present- 
ed as if each job has only one stream. Multiple streams 
for one job are scheduled independently during a cycle, 
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but they are treated as a single entity at the end of the 
cycle for disposition and scheduling of the next cycle's 
transfers. 

The method to schedule jobs at the beginning of a 
cycle is described in terms of resource consumption s 
functions. This generates admit list or wait queue entries 
on all resources for all jobs that want service during this 
cycle. For a cycle of length 7, the initial batch of k ad- 
mitted jobs has the property that the time predicted by 
all resource consumption functions to serve all jobs is 
less than 7 

For job J, the consumption functions are denoted 
CflM Lfl), where resource R is one of the following: 
memory read bandwidth (memr), memory write band- 
width (memw), a network controller's bandwidth (n), a 
disk controller's bandwidth (c), or a disk's bandwidth (d). 
L R is a list of jobs already admitted to this cycle for re- 
source R. A consumption function returns a conserva- 
tive estimate (upper bound) of the incremental time add- 
ed by job J to a list of jobs L R previously admitted to the 
resource R in a given cycle. Define Cf^nU, L H ) to be the 
total estimated time for all jobs in L R and CfAJ, empty), 
to be the time estimated for job J alone. The resource 
consumption functions must be conservative so that the 
algorithm only admits jobs that can in fact be served by 
that resource during the cycle. To make the equations 
operate correctly, C^J, should be commutative and 
monotonic with respect to L R 

The following description as depicted in Fig. 4, in- 
volves cycle time accumulators denoted Admitted R for 
each resource R listed above, except memory read and 
write are aggregated into one resource, denoted mem. 
The accumulators contain the total estimated time of 
jobs admitted to each resource for this cycle. A resource 
saturation flag, denoted Saturated & becomes TRUE 
when the algorithm refuses admission to that resource 
for a job request this cycle. 

1 . Initialize all the cycle time accumulators to 0 and 
all the saturation flags to FALSE 60. 

2. For each job J in the priority queue of jobs for this 
cycle 62, 64 in decreasing order of importance, do 
the following steps: Let 

n denote the network controller that will trans- 
mit the data for job J 

c denote the disk controller that will retrieve the 
data for job J 

d denote the disk that will retrieve the data for 
job J. 

(a) Let M= C memw ( J, L memw ) + C mem £J t 
If 7 - Admitted mem <M66 then append job J to 
the wait queue for the memory subsystem 68, 
otherwise job J is admitted for the memory sub- 
system 70: Admitted mem + = M; L memr + = J, 

t-memw + = J- 

(b) If Saturated,, or if 7 - Admitted n < C^J, L^j 



72 then set Saturated n = TRUE and append job 
J to the wait queue for network controller n 74, 
otherwise job J is admitted for network control- 
ler n 76: Admitted^ = C^J, L n ); L n + = J. 

(c) If Saturated c ox if 7- Admitted c < C^J, 

78 then set Saturated c = TRUE and append job 
J to the wait queue for disk controller c 80, oth- 
erwise job J is admitted for disk controller c 82: 
Admitted^ = C£J, L c ) ; L c + = J. 

(d) If Saturated d or if 7- Admitted d < CJiJ, 

84 then set Saturated d - TRUE and append job 
J to the wait queue for disk d 86, otherwise job 
J is admitted to the ready queue for disk d 88: 
Admitted^ = Cj{J, L d )\ = J. 

(e) If job J has been admitted for memory, disk 
controller, and disk, then add it to the initial 
batch of disk requests 90. 

Note that all resources for a job are reserved (or an 
entry is put on appropriate wait queue) before the next 
less important job is considered. This is to maintain the 
property that a less important job will not be handled if 
a more important one could have been. It is quite pos- 
sible that a job that cannot fit in the initial batch of disk 
requests will subsequently be able to be scheduled into 
disk slack time during the execution of the cycle. There- 
fore the scheduling is designed to prevent less important 
jobs from saturating the schedule for the network or 
memory subsystems. 

When the cycle begins, all the disk requests in the 
initial batch are issued to the operating system. There 
is a tradeoff involving the number of outstanding disk I/ 
Os in the kernel. More outstanding l/Os should produce 
a more efficient schedule, but it incurs a toss of control 
over the response time to any particular request. The 
initial batch size is the largest that can be reliably pre- 
dicted by the resource consumption functions to com- 
plete all its transfers in time to meet the application 
deadlines. 

Given large transfers, the disk bus and controller 
may be more of a bottleneck than the disks. For in- 
stance, a SCSI bus can handle seven or more devices, 
but a SCSI bus and controller typically saturate at the 
bandwidth of two fast disks performing large transfers. 
So this may make it unnecessary to have a good disk 
model; the bottleneck may be the SCSI controller. 

As a first cut at simple consumption functions, the 
simple linear approximation C^J, L R ) = bytes(J) + band- 
width R + overhead R is used. Conservative values for the 
bandwidths and overheads for a particular system can 
be determined by running small programs that exercise 
and measure the resources. In particular, the sustained 
transfer bandwidth as a function of the number of bytes 
is measured, and the overhead for memory is 0, the 
overhead for a disk controller is the response time of a 
1 -byte read that hits in the on-disk track cache, the over- 
head for a network is the time to read or write a 1 -byte 
message, and the overhead for a disk is the full stroke 
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seek plus maximum rotational latency. 

Transfers that are not admitted at the beginning of 
a cycle may still be served during that cycle. When a 
disk or net transfer finishes, the actual resource con- 
sumption, based on completion time, is compared with 
the predictions at each level of the resource hierarchy. 
Since the predictions are conservative, the actual times 
should be smaller. In this case slack time has been re- 
leased, so the highest priority transfer that has not yet 
been admitted is checked to see whether it can now be 
admitted to the current cycle. This process is termed 
■slack filling". 

Conservative admission at the beginning of a cycle, 
coupled with slack filling to serve additional streams, au- 
tomatically provides robustness. If streams start slipping 
their schedules for any reason, whether the fault of the 
server or a result of some external trouble, less slack 
will become available during the cycle, effectively de- 
grading the "least worthy" streams (according to the pri- 
oritization policy as discussed above) to preserve prop- 
er service to as many streams as can be handled by the 
available resources. As an example, ideally, a 10% loss 
of resources on a fully loaded machine would only im- 
pact 1 0% of the clients; the other 90% would be unaware 
of the problem. Furthermore, effective streams are se- 
lected for degradation. The shortage of slack from a par- 
ticular resource, e.g. an overloaded disk drive, will block 
the least important transfers that use that resource, with- 
out impairing transfers that are independent of that re- 
source. The more specific and accurate the resource 
models are, the better the confinement of effects of re- 
source shortages. This protection works even if job ad- 
mission erroneously admits more jobs than can be 
served within the resource bounds. 

Because of the slack filling during the disk and net- 
work cycles, potential cycle overruns can be detected 
before transfers are actually late. At the start of a cycle, 
the jobs not in the initial batch are at risk of missing their 
deadlines. Jobs further down the wait list are at greater 
risk, quantified by the bandwidth requirement stated in 
each job's descriptor. Compensating mechanisms can 
engage before any stream degradation is observed at 
the client side. This straightforward support for perform- 
ance surveillance is a valuable feature. In a highly un- 
predictable environment another step is taken as an en- 
hancement to the preferred embodiment. The progress 
of transfers during the cycle is monitored and if a severe 
lack of progress is observed, victims are chosen to be 
aborted from among the least important transfers al- 
ready in progress. 

The slack-filling technique is described in the con- 
text of copy through memory architectures (the server 
reads from disk to buffer, sends from buffer to network). 
It will also work with network-attached disks and desk 
area network organizations that use high bandwidth net- 
working such as a fibre channel, and that may support 
bandwidth allocation. If the network is not a bottleneck, 
transfer scheduling is unnecessary for the network side. 



The server schedules the disk l/Os and data from disk 
is immediately dumped through the network channel to 
the client. 

During a cycle, whenever a disk read finishes, the 
s filled data buffer is handed off to the network manager 
for transmission in accordance with the network sched- 
ule. The network manager can dynamically estimate the 
desired bandwidth associated with this buffer as the ra- 
tio of buffer size to remaining time in this network cycle. 
10 The difference between the conservative predicted disk 
read time and the actual completion time is additional 
slack that can be released to the disk, disk controller, 
and memory write schedules. Given additional slack, 
each resource attempts to admit additional high priority 
is jobs from the resource wait queue. If a job becomes ad- 
mitted to disk, disk controller, memory read, and mem- 
ory write, its disk request is issued to the operating sys- 
tem. 

The method to calculate the slack that is freed in 
20 one disk's schedule by the completion of a read uses 
four variables: 

Admitted: The list of requests for this disk that have 
been admitted to this cycle; 
2S Completed: The list of requests for this disk that 
have completed this cycle; 
Stack: The accumulated available slack for this cy- 
cle; and 

Prevjdone: The time at which the previous request 
30 completed. 

At the start of a cycle, the variables are initialized to: 

Admitted = L d 
35 Completed = empty 

Slack- T- CJnil, Admitted) 
Prevjdone - currentjtime 

When the request for job J completes, the folbwing 
steps are performed: 

1 . Calculate the slack released by this job and up- 
date the variables. The stack is the time it was pre- 
dicted to add to the partial schedule, minus the time 

45 it actually added. 

(a) Slack* = CJ[J t Completed) - (currentjtime 
- Prev_done) 

(b) Prevjdone - current_time 
so (c) Completed = Completed + J 

2. Check admission for the highest priority job J' in 
the disk's wait queue. If CJ^J, Admitted) <- Slack 
then 

55 

(a) Slack- = CJ(J t Admiteo) 

(b) Admitted* = J' 

(c) If J' is not on wait lists for disk controller or 
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memory then issue the disk read to the operat- 
ing system. 

(d) Iterate step 2 to check the next highest pri- 
ority job. 

When the disk cycle finishes early because all the 
transfers have completed, the next cycle is scheduled 
without delay. The new cycle now has extra time by 
starting immediately but ending at the proper time to 
maintain the correct pacing of the periodic schedule. Al- 
though the cycle can start early, the amount of work 
ahead is constrained by buffer limits. With respect to 
network work ahead, ready buffers that belong to future 
cycles have lower network scheduling priority than any 
buffer belonging to this cycle. This prevents work ahead 
from preempting the slack backfilling that enlarges the 
number of streams the server can concurrently sustain. 
In essence, the network doesnt work ahead, although 
as with the disk, the network may start its next cycle pre- 
maturely if ft runs out of work for the current cycle. 

The network controller has a priority queue of trans- 
fers for this cycle. This cycle has the same duration as 
the disk cycle, but is phase shifted. The minimum phase 
shift is the time actually needed to complete the first disk 
read for this cycle. The maximum phase shift is the full 
cycle duration. This is what it takes to finish all disk reads 
then hand them over as a complete batch to the network 
scheduler. A small phase difference implies a small 
server query/response latency, but a large phase differ- 
ence is good for network bandwidth because a larger 
number of buffers are available for concurrent transmis- 
sion, so many channels can be serviced between calls. 
If the first buffer for a cycle arrives late, the phase dif- 
ference is increased to allow a full cycle period for the 
network transmissions. If a network cycle finishes all its 
scheduled transmissions before the deadline, and if a 
buffer for the next network cycle is available for trans- 
mission, the phase difference will be decreased, but on- 
ly by a small amount. That is, transmission for the next 
network cycle will commence immediately to avoid 
wasting network bandwidth, but the end time of the next 
network cycle will only be advanced by a small amount 
to protect the network bandwidth. 

Because the disk and network cycles dont exactly 
coincide, the simple memory bandwidth check de- 
scribed earlier may need refinement in an alternate em- 
bodiment. This is unimportant if the memory bandwidth 
is not the system bottleneck, or if the load from cycle to 
cycle is relatively constant so coincident load on the 
memory from the previous network cycle and the cu rrent 
overlapping disk cycle doesn't exceed the memory 
bandwidth. If the network is relatively fast then, on a 
server with copy-through-memory architecture having 
many disks, the memory bandwidth will likely be a bot- 
tleneck. In this case it will be necessary to schedule and 
allocate memory bandwidth each time a new network or 
disk cycle begins, by reserving a portion for system 
overhead and allocating the remainder to serve the net- 



work and disk traffic, favoring whichever of the two is 
more of a system bottleneck. 

Disk throughput has been improved by issuing an 
"initial batch" of requests to a disk so I/O scheduling in 
5 the operating system could reduce the seek overhead. 
An analogous technique applies to network transmis- 
sion from server to client. The "initial batch" of k jobs 
from the priority queue is the largest set of most impor- 
tant jobs that have aggregate bandwidth less than X% 
io of the recent observed network bandwidth, where X is 
a tunable parameter set to 100% for reservation net- 
works, and less than 1 00% to provide a safety factor for 
networks such as Ethernet that can suffer sudden loss 
of bandwidth because of congestion. 
is The network "inner loop" is (1 ) determine the set of 
communication channels that can accept an asy- 
chronous write without blocking, and (2) if any job in the 
initial batch can accept an asynchronous write without 
blocking then do an asynchronous write for every such 
20 job in the initial batch, else do an asynchronous write for 
the first such job in the priority queue. But a write is in- 
itiated only if, based on recently observed network band- 
width, it is predicted that (1) the write will complete by 
the end time of this network cycle, and (2) if the write is 
25 not from the initial batch, it will leave enough bandwidth 
in the cycle to transmit the remaining data for the initial 
batch. 

Slack freeing when a network transfer completes is 
similar to that described in the disk case above. Slack 
30 is freed down the resource tree, admitting additional 
jobs from resource wait lists when possible. If a job be- 
comes admitted everywhere, issue its disk request to 
the operating system. When a job completes all its net- 
work transfers for a cycle, if the job has transfers for any 
35 future cycles, it is placed into the job queue of the near- 
est such cycle. 

As described above, the initial batch of jobs for con- 
current service at the beginning of a network cycle is the 
first k jobs on the priority queue, such that their aggre- 
40 gate bandwidth demand is safely within the available 
network bandwidth. For networks like Ethernet that can 
suffer sudden and prolonged loss of bandwidth, it may 
be necessary to initially focus on a smaller number of 
the most important jobs, and only start to serve others 
45 as it becomes increasingly likely that the first jobs will 
finish before the cycle end. In such a case, it is desirable 
to derate k, and then start serving additional jobs as the 
first k make progress, but before any of the first Jr com- 
pletes. In this case, k is an adaptive function of the fol- 
so lowing four parameters: 

1. The bandwidth required to complete the trans- 
mission of the initial batch by the end of this cycle. 

2. The recent observed network throughput. 

ss 3. The derating factor that reduces to an acceptable 
level the potential impact of future bandwidth loss, 
should it occur during the remainder of the cycle. 
4. The amount of time remaining in the cycle. 
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A disk or network cycle is terminated when all out- 
standing jobs have completed, and the stack is insuffi- 
cient to issue the next job in the priority queue; or the 
duration of the cycle time has been reached. 

The scheduling is intended to prevent the second 
case from occurring. If it should happen anyway, e.g. if 
a disk goes into a recalibration cycle for 3/4 seconds, so 
transfers that were issued have not completed by cycle 
end, then the disposition policies are applied to all trans- 
fers that did not complete. 

To support the canceling of asynchronous network 
transmissions, the bulk data of a transmission is frag- 
mented. Each fixed-length fragment has a small header 
that conveys indications to the client such as abandon 
the current in-progress transfer, or this transfer is slip- 
ping and will finish later than scheduled. Because some 
asynchronous mechanisms, e.g. in UNIX, will complete 
a read or write with a partial packet, it is possible to re- 
quest the transmission of a large block with scatter/gath- 
er to insert small fragment headers indicating proper 
completion, and between asynchronous reads/writes 
change plans and alter the next fragment header to in- 
dicate a drop or slip. This obtains the approximate 
throughput of large block transmissions in the normal 
case, but short latency that is proportional to the frag- 
ment size, in case of abandonment. 

Canceling of in-progress transfers may also be 
needed if the prioritization policy contains a category of 
high priority jobs that are permitted to barge in during a 
cycle, displacing some currently scheduled and running 
transfers. 

Each job that has not completed at the end of the 
cycle, i.e., remains on a wait queue or is canceled while 
in progress, has its disposition policy applied. Examples 
of disposition policies include: 

Pause the schedule and just slip the entire transfer 

schedule for this stream into the future. 

Try to catch up by transmitting both this cycle's and 

the next cycle's frames during the next cycle. 

Drop frames to maintain the presentation time 

schedule. 

Reduce the frame rate to maintain the presentation 
time schedule. 

Reduce the frame resolution to maintain the pres- 
entation time schedule. 

Kill the job, which stated that it is unwilling to adapt 
to resource reduction. 

The disposition policy for a stream can change. For 
example, a stream having the "drop frames" policy dur- 
ing normal display may change to "pause the schedule" 
if the application does the "pause" VCR operation. Then 
the server and client buffers for that stream can continue 
to fill up as work ahead, until buffer limits apply back 
pressure. 

Although the subject invention has been described 
with respect to preferred embodiments, it will be readily 



apparent to those having ordinary skill in the art to which 
it appertains that changes and modifications may be 
made thereto without departing from the scope of the 
subject invention as defined by the appended claims. 



Claims 



1. A method of playing back continuous media com- 
10 prising the steps of: 

performing job admission based on loading of 
a server; 

performing hierarchical scheduling of a plurality 
15 of resources based on an estimate of consump- 

tion of a plurality of data streams from each of 
said plurality of resources with respect to avail- 
able time during a single schedule cycle of a 
plurality of schedule cycles substantially ensur- 
20 ing completion of at least one of said plurality 

of data streams within a particular time frame 
of said single schedule cycle; 
improving utilization of said plurality of resourc- 
es by slack filling additional available time dur- 
2S ing said particular time frame of said single 

schedule cycle; and 

degrading a first part of one of said plurality of 
data streams according to application specified 
disposition policies with respect to a second 
30 part of one of said plurality of data streams 

deemed less important according to a specified 
prioritization policy of said server. 

2. The method of claim 1 wherein said job admission 
35 further comprises the steps of: 

trivially admitting said at least one of said plu- 
rality of data streams when said server is un- 
derloaded; 

40 trivially rejecting said at least one of said plu- 

rality of data streams when said server is over- 
loaded; and 

conditionally admitting said at least one of said 
plurality of data streams when said server is in 
45 a normal load condition. 

3. The method of claim 1 wherein said hierarchial 
scheduling further comprises the step of generating 
a first batch of data streams from said plurality of 

50 data streams that are to complete within said time 
frame of said single schedule cycle. 

4. The method of claim 3 wherein said hierarchial 
scheduling further comprises the step of generating 

55 a plurality of waiting queues on said resources of 
said plurality of data streams that are not members 
of said first batch of data streams. 
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5. The method of claim 4 wherein said improving uti- 
lization of said plurality of resources by slack filling 
further comprises the steps of comparing actual re- 
source consumption with predictions of said hierar- 
chical resource scheduling and conditionally admit- s 
ting a second batch of data streams from said plu- 
rality of waiting queues when said actual resource 
consumption is smaller. 

6. The method of claim 2 wherein the step of trivially 10 
admitting said at least one data stream further com- 
prises the steps of estimating the resource con- 
sumption and determining that said at least one da- 
ta stream will fit within said slack. 

15 

7. The method of claim 1 wherein said plurality of re- 
sources includes a memory subsystem. 

8. The method of claim 1 wherein said plurality of re- 
sources includes a network controller. 20 

9. The method of claim 1 wherein said plurality of re- 
sources includes a disk controller. 

10. The method of claim 1 wherein said plurality of re- 25 
sources includes a disk. 

11. The method of claim 1 wherein said performing hi- 
erarchical scheduling is done based on a plurality 

of network and disk cycles. 30 

12. The method of claim 11 wherein said plurality of cy- 
cles includes a first cycle and a second cycle, said 
second cycle having extra time when said first cycle 
finishes early. 35 

13. The method of claim 1 wherein said specified prior- 
itization policy increases protection from degrada- 
tion of said at least one of said plurality of data 
streams in proportion to said plurality of schedule 40 
cycles of which said at least one of said plurality of 
data streams has been served. 

14. A method of playing back continuous media com- 
prising the steps of: 45 



source consumption is smaller as a means of 
slack filling any available time during said 
schedule cycle; and 

degrading parts of said at least one data stream 
according to an application specified disposi- 
tion policy. 

15. An apparatus to play back continuous media com- 
prising: 

schedule means for performing hierarchical 
scheduling of a plurality of resources based on 
estimating resource consumption for determin- 
ing that at least one data stream will fit within a 
schedule cycle of a plurality of schedule cycles 
among a collection of data streams for each of 
said plurality of resources; 
utilization improvement means of said plurality 
of resources for comparing actual resource 
consumption with predictions of said hierarchi- 
cal resource scheduling and serving additional 
data streams from said collection of data 
streams from a plurality of waiting queues when 
said actual resource consumption is smaller as 
a means of stack filling any available time dur- 
ing said schedule cycle; and 
degradation means for degrading parts of said 
at least one data stream not fully served ac- 
cording to an application specified disposition 
policy. 

16. The apparatus of claim 15 wherein said specified 
prioritization policy increases protection from deg- 
radation of said at least one data stream in propor- 
tion to said plurality of schedule cycles of which said 
at least one data stream has been served. 



performing hierarchical scheduling of a plurality 
of resources based on estimating the resource 
consumption and determining that at least one 
data stream will fit within a schedule cycle so 
among a collection of data streams for each of 
said plurality of resources; 
improving utilization of said plurality of resourc- 
es by comparing actual resource consumption 
with predictions of said hierarchical resource ss 
scheduling and serving additional data streams 
from said collection of data streams from a plu- 
rality of waiting queues when said actual re- 
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